Kernel name: “NYC Taxi EDA - Update: The fast & the curious”
Link: https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious
In this project we are going to explore the kernel above showing why the author choose the functions used, comparing to others alternatives and we are going to emphasize the interesting points of the kernel.
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('alluvial') # visualisation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('lubridate') # date and time
library('geosphere') # geospatial locations
library('leaflet') # maps
library('leaflet.extras') # maps
library('maps') # maps
library('xgboost') # modelling
library('caret') # modelling
library('profvis') # code execution time anaylsis
library("data.table")
library("tibble")
train <- as.tibble(fread('train.csv'))
In R, read.csv is part of the regular functions and is used for load data.frame from a csv file. But when we’re dealing with a huge data.frame this function can take a long time to run.
print(paste("In this case the dataset is quite huge:",dim(train)[1], "rows and",
dim(train)[2], "columns."))
## [1] "In this case the dataset is quite huge: 1458644 rows and 11 columns."
So in this part the author used a function called fread that performs much faster than read.csv (check the time of each function using profvis!!).
After that other function should be compared: load. This function is used to load variables that have been stored in a .RData file and runs very fast comparing with read.csv and fread.
When is a good ideia to use load? When it’s possible to use a background process to update the data.frame and save it in .RData file.
Let’s take a look at the three possibilities:
library("profvis")
library("data.table")
library("tibble")
library("readr")
profvis({
# fread
train <- as.tibble(fread("train.csv"))
test <- as.tibble(fread("test.csv"))
# read.csv
train_readcsv <- read.csv("train.csv")
# read_csv -> from "readr" package
train_read_csv <- read_csv("train.csv")
# loading RData
save(train_readcsv, file = "train_data.RData")
rm(train_readcsv)
load(file = "train_data.RData")
})
All the information bellow was “greped” from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
Tibbles
“Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).”
Major points:
A brief overview of our data can summaries the descriptive statistics values of the dataset and detect abnormal items or outliers.
For the summaries
summary(train)
## id vendor_id pickup_datetime dropoff_datetime
## Length:1458644 Min. :1.000 Length:1458644 Length:1458644
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.535
## 3rd Qu.:2.000
## Max. :2.000
## passenger_count pickup_longitude pickup_latitude dropoff_longitude
## Min. :0.000 Min. :-121.93 Min. :34.36 Min. :-121.93
## 1st Qu.:1.000 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99
## Median :1.000 Median : -73.98 Median :40.75 Median : -73.98
## Mean :1.665 Mean : -73.97 Mean :40.75 Mean : -73.97
## 3rd Qu.:2.000 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96
## Max. :9.000 Max. : -61.34 Max. :51.88 Max. : -61.34
## dropoff_latitude store_and_fwd_flag trip_duration
## Min. :32.18 Length:1458644 Min. : 1
## 1st Qu.:40.74 Class :character 1st Qu.: 397
## Median :40.75 Mode :character Median : 662
## Mean :40.75 Mean : 959
## 3rd Qu.:40.77 3rd Qu.: 1075
## Max. :43.92 Max. :3526282
summary(test)
## id vendor_id pickup_datetime passenger_count
## Length:625134 Min. :1.000 Length:625134 Min. :0.000
## Class :character 1st Qu.:1.000 Class :character 1st Qu.:1.000
## Mode :character Median :2.000 Mode :character Median :1.000
## Mean :1.535 Mean :1.662
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :9.000
## pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## Min. :-121.93 Min. :37.39 Min. :-121.93 Min. :36.60
## 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99 1st Qu.:40.74
## Median : -73.98 Median :40.75 Median : -73.98 Median :40.75
## Mean : -73.97 Mean :40.75 Mean : -73.97 Mean :40.75
## 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96 3rd Qu.:40.77
## Max. : -69.25 Max. :42.81 Max. : -67.50 Max. :48.86
## store_and_fwd_flag
## Length:625134
## Class :character
## Mode :character
##
##
##
Data overview
library("dplyr")
glimpse(train)
## Observations: 1,458,644
## Variables: 11
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
glimpse(test)
## Observations: 625,134
## Variables: 9
## $ id <chr> "id3004672", "id3505355", "id1217141", "id2...
## $ vendor_id <int> 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1...
## $ pickup_datetime <chr> "2016-06-30 23:59:58", "2016-06-30 23:59:53...
## $ passenger_count <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 4, 1, 1, 1, 1...
## $ pickup_longitude <dbl> -73.98813, -73.96420, -73.99744, -73.95607,...
## $ pickup_latitude <dbl> 40.73203, 40.67999, 40.73758, 40.77190, 40....
## $ dropoff_longitude <dbl> -73.99017, -73.95981, -73.98616, -73.98643,...
## $ dropoff_latitude <dbl> 40.75668, 40.65540, 40.72952, 40.73047, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
Another popular way to make a data overview is using str. It is very similar to glimpse but str shows less data.
str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1458644 obs. of 11 variables:
## $ id : chr "id2875421" "id2377394" "id3858529" "id3504673" ...
## $ vendor_id : int 2 1 2 2 2 2 1 2 1 2 ...
## $ pickup_datetime : chr "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
## $ dropoff_datetime : chr "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
## $ passenger_count : int 1 1 1 1 1 6 4 1 1 1 ...
## $ pickup_longitude : num -74 -74 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.7 40.8 40.7 40.8 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.7 40.7 40.7 40.8 ...
## $ store_and_fwd_flag: chr "N" "N" "N" "N" ...
## $ trip_duration : int 455 663 2124 429 435 443 341 1551 255 1225 ...
## - attr(*, ".internal.selfref")=<externalptr>
levels(as.factor(train$vendor_id))
## [1] "1" "2"
To avoid an inappropriate analysis of the data, the missing values should be analysed to measure their impact in the whole dataset.
If the number of cases is less than 5% of the sample, then the researcher can drop them.
For more info about this subject: https://www.statisticssolutions.com/missing-values-in-data/
Luckly there is no missing values in data (easy mode):
sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0
Here the author did an interesting move: he combined train and test data sets into a single one in order to avoid a closely approach that matches just one part of data.
CAUTION: we can only combine the two data sets for a better overview but for the creation of a machine learning model we should keep train and test separate.
# Mutate creates dset, dropff_datetime and trip_duration columns for test dataset
# For train dataset only dset is created by mutate
# bind_rows combines the data sets into one
combine <- bind_rows(train %>% mutate(dset = "train"),
test %>% mutate(dset = "test",
dropoff_datetime = NA,
trip_duration = NA))
combine <- combine %>% mutate(dset = factor(dset))
glimpse(combine)
## Observations: 2,083,778
## Variables: 12
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dset <fct> train, train, train, train, train, train, t...
For our following analysis, we will turn the data and time from characters into date objects. We also recode vendor_id as a factor. This makes it easier to visualise relationships that involve these features.
library('lubridate')
train <- train %>%
mutate(pickup_datetime = ymd_hms(pickup_datetime),
dropoff_datetime = ymd_hms(dropoff_datetime),
vendor_id = factor(vendor_id),
passenger_count = factor(passenger_count))
ASSUME NOTHING! It is worth checking whether the trip_durations are consistent with the intervals between the pickup_datetime and dropoff_datetime. Presumably the former were directly computed from the latter, but you never know. Below, the check variable shows “TRUE” if the two intervals are not consistent:
train %>%
mutate(check = abs(int_length(interval(dropoff_datetime,pickup_datetime)) + trip_duration) > 0) %>%
select(check, pickup_datetime, dropoff_datetime, trip_duration) %>%
group_by(check) %>%
count()
## # A tibble: 1 x 2
## # Groups: check [1]
## check n
## <lgl> <int>
## 1 FALSE 1458644
And we find that everything fits perfectly.
“Visualizations of feature distributions and their relations are key to understanding a data set, and they often open up new lines of inquiry. I always recommend to examine the data from as many different perspectives as possible to notice even subtle trends and correlations.”
The author used a wonderful package for interative maps called leaflet. There a couple of models that you can try. We changed a little bit to try the leaflet-extras providers.
library(leaflet)
library(leaflet.extras)
set.seed(1234)
foo <- sample_n(train, 8e3)
leaflet(data = foo) %>% addProviderTiles(providers$Esri.WorldTopoMap) %>%
addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 1,
color = "blue", fillOpacity = 0.3)
Using this visualization feature we notice the majority points of our trips (Manhattan and JFK airport).
library(ggplot2)
train %>%
ggplot(aes(trip_duration)) +
geom_histogram(fill = "red", bins = 150) +
scale_x_log10() +
scale_y_sqrt() +
labs(title = "New York Taxi - EDA", x = "Trip Duration (s)", y = "Number of Events")
We find:
train %>%
arrange(desc(trip_duration)) %>%
select(trip_duration, pickup_datetime, dropoff_datetime, everything()) %>%
head(10)
## # A tibble: 10 x 11
## trip_duration pickup_datetime dropoff_datetime id vendor_id
## <int> <dttm> <dttm> <chr> <fct>
## 1 3526282 2016-02-13 22:46:52 2016-03-25 18:18:14 id0053~ 1
## 2 2227612 2016-01-05 06:14:15 2016-01-31 01:01:07 id1325~ 1
## 3 2049578 2016-02-13 22:38:00 2016-03-08 15:57:38 id0369~ 1
## 4 1939736 2016-01-05 00:19:42 2016-01-27 11:08:38 id1864~ 1
## 5 86392 2016-02-15 23:18:06 2016-02-16 23:17:58 id1942~ 2
## 6 86391 2016-05-31 13:00:39 2016-06-01 13:00:30 id0593~ 2
## 7 86390 2016-05-06 00:00:10 2016-05-07 00:00:00 id0953~ 2
## 8 86387 2016-06-30 16:37:52 2016-07-01 16:37:39 id2837~ 2
## 9 86385 2016-06-23 16:01:45 2016-06-24 16:01:30 id1358~ 2
## 10 86379 2016-05-17 22:22:56 2016-05-18 22:22:35 id2589~ 2
## # ... with 6 more variables: passenger_count <fct>,
## # pickup_longitude <dbl>, pickup_latitude <dbl>,
## # dropoff_longitude <dbl>, dropoff_latitude <dbl>,
## # store_and_fwd_flag <chr>
Those records would correspond to 24-hour trips and beyond, with a maximum of almost 12 days. I know that rush hour can be bad, but those values are a little unbelievable.
Over the year, the distributions of pickup_datetime and dropoff_datetime look like this: mark and even a few way above it:
p1 <- train %>%
ggplot(aes(pickup_datetime)) +
geom_histogram(fill = "red", bins = 120) +
labs(x = "Pickup dates")
p2 <- train %>%
ggplot(aes(dropoff_datetime)) +
geom_histogram(fill = "blue", bins = 120) +
labs(x = "Dropoff dates")
layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1, p2, layout=layout)
Fairly homogeneous, covering half a year between January and July 2016. There is an interesting drop around late January early February:
train %>%
filter(pickup_datetime > ymd("2016-01-20") & pickup_datetime < ymd("2016-02-10")) %>%
ggplot(aes(pickup_datetime)) +
geom_histogram(fill = "red", bins = 120)
That’s winter in NYC, so maybe snow storms or other heavy weather? Events like this should be taken into account, maybe through some handy external data set?
In the plot above we can already see some daily and weekly modulations in the number of trips. Let’s investigate these variations together with the distributions of passenger_count and vendor_id by creating a multi-plot panel with different components:
p1 <- train %>%
group_by(passenger_count) %>%
count() %>%
ggplot(aes(passenger_count, n, fill = passenger_count)) +
geom_col() +
scale_y_sqrt() +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(vendor_id, fill = vendor_id)) +
geom_bar() +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(store_and_fwd_flag)) +
geom_bar() +
theme(legend.position = "none") +
scale_y_log10()
p4 <- train %>%
mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
group_by(wday, vendor_id) %>%
count() %>%
ggplot(aes(wday, n, colour = vendor_id)) +
geom_point(size = 4) +
labs(x = "Day of the week", y = "Total number of pickups") +
theme(legend.position = "none")
p5 <- train %>%
mutate(hpick = hour(pickup_datetime)) %>%
group_by(hpick, vendor_id) %>%
count() %>%
ggplot(aes(hpick, n, color = vendor_id)) +
geom_point(size = 4) +
labs(x = "Hour of the day", y = "Total number of pickups") +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,5),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)
We find:
train %>%
group_by(passenger_count) %>%
count()
## # A tibble: 10 x 2
## # Groups: passenger_count [10]
## passenger_count n
## <fct> <int>
## 1 0 60
## 2 1 1033540
## 3 2 210318
## 4 3 59896
## 5 4 28404
## 6 5 78088
## 7 6 48333
## 8 7 3
## 9 8 1
## 10 9 1
train %>%
group_by(store_and_fwd_flag) %>%
count()
## # A tibble: 2 x 2
## # Groups: store_and_fwd_flag [2]
## store_and_fwd_flag n
## <chr> <int>
## 1 N 1450599
## 2 Y 8045
y_count <- table(train$store_and_fwd_flag)['Y']/sum(table(train$store_and_fwd_flag))
paste0('Trip data stored in memory due to no connection represents ',round(y_count*100, digits = 2),'% of the values.')
## [1] "Trip data stored in memory due to no connection represents 0.55% of the values."
The trip volume per hour of the day depends somewhat on the month and strongly on the day of the week:
p1 <- train %>%
mutate(hpick = hour(pickup_datetime),
Month = factor(month(pickup_datetime, label = TRUE))) %>%
group_by(hpick, Month) %>%
count() %>%
ggplot(aes(hpick, n, color = Month)) +
geom_line(size = 1.5) +
labs(x = "Hour of the day", y = "count")
p2 <- train %>%
mutate(hpick = hour(pickup_datetime),
wday = factor(wday(pickup_datetime, label = TRUE))) %>%
group_by(hpick, wday) %>%
count() %>%
ggplot(aes(hpick, n, color = wday)) +
geom_line(size = 1.5) +
labs(x = "Hour of the day", y = "count")
layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1, p2, layout=layout)
We find:
p1 <- train %>%
filter(pickup_longitude > -74.05 & pickup_longitude < -73.7) %>%
ggplot(aes(pickup_longitude)) +
geom_histogram(fill = "red", bins = 40)
p2 <- train %>%
filter(dropoff_longitude > -74.05 & dropoff_longitude < -73.7) %>%
ggplot(aes(dropoff_longitude)) +
geom_histogram(fill = "blue", bins = 40)
p3 <- train %>%
filter(pickup_latitude > 40.6 & pickup_latitude < 40.9) %>%
ggplot(aes(pickup_latitude)) +
geom_histogram(fill = "red", bins = 40)
p4 <- train %>%
filter(dropoff_latitude > 40.6 & dropoff_latitude < 40.9) %>%
ggplot(aes(dropoff_latitude)) +
geom_histogram(fill = "blue", bins = 40)
layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)
Here we had constrain the range of latitude and longitude values, because there are a few cases which are way outside the NYC boundaries. The resulting distributions are consistent with the focus on Manhattan that we had already seen on the map. These are the most extreme values from the pickup_latitude feature:
train %>%
arrange(pickup_latitude) %>%
select(pickup_latitude, pickup_longitude) %>%
head(5)
## # A tibble: 5 x 2
## pickup_latitude pickup_longitude
## <dbl> <dbl>
## 1 34.4 -65.8
## 2 34.7 -75.4
## 3 35.1 -71.8
## 4 35.3 -72.1
## 5 36.0 -77.4
train %>%
arrange(desc(pickup_latitude)) %>%
select(pickup_latitude, pickup_longitude) %>%
head(5)
## # A tibble: 5 x 2
## pickup_latitude pickup_longitude
## <dbl> <dbl>
## 1 51.9 -72.8
## 2 44.4 -67.0
## 3 43.9 -71.9
## 4 43.5 -74.2
## 5 43.1 -72.6
We need to keep the existence of these (rather astonishing) values in mind so that they don’t bias our analysis.
Pickup point
train %>%
select(pickup_longitude, pickup_latitude) %>%
group_by(pickup_longitude, pickup_latitude) %>%
count(sort = TRUE) %>%
summary()
## pickup_longitude pickup_latitude n
## Min. :-121.93 Min. :34.36 Min. : 1.000
## 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: 1.000
## Median : -73.98 Median :40.75 Median : 1.000
## Mean : -73.97 Mean :40.75 Mean : 1.056
## 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: 1.000
## Max. : -61.34 Max. :51.88 Max. :39.000
Dropoff point
train %>%
select(dropoff_longitude, dropoff_latitude) %>%
group_by(dropoff_longitude, dropoff_latitude) %>%
count(sort = TRUE) %>%
summary()
## dropoff_longitude dropoff_latitude n
## Min. :-121.93 Min. :32.18 Min. : 1.000
## 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: 1.000
## Median : -73.98 Median :40.75 Median : 1.000
## Mean : -73.97 Mean :40.75 Mean : 1.029
## 3rd Qu.: -73.96 3rd Qu.:40.77 3rd Qu.: 1.000
## Max. : -61.34 Max. :43.92 Max. :39.000
Now it’s time to examine how those features are related to each other and to our target trip duration.
p1 <- train %>%
mutate(day_week = wday(pickup_datetime, label = TRUE)) %>%
group_by(day_week, vendor_id) %>%
summarise(trip_duration_mean = median(trip_duration)/60) %>%
ggplot(aes(day_week, trip_duration_mean, color = vendor_id)) +
geom_point(size = 4) +
labs(x = "Day of the week",y = "Median trip duration [min]")
p2 <- train %>%
mutate(hour_day = hour(pickup_datetime)) %>%
group_by(hour_day, vendor_id) %>%
summarise(trip_duration_mean = median(trip_duration)/60) %>%
ggplot(aes(hour_day, trip_duration_mean, color = vendor_id)) +
geom_smooth(method = "loess", span = 1/2) +
geom_point(size = 4) +
labs(x = "Hour of the day",y = "Median trip duration [min]") +
theme(legend.position = "none")
layout <- matrix(c(1,2), 2,1, byrow = FALSE)
multiplot(p1,p2, layout = layout)
We find:
Are different numbers of passengers and/or the different vendors correlated with the duration of the trip?
train %>%
ggplot(aes(passenger_count, trip_duration, color = passenger_count)) +
geom_boxplot() +
scale_y_log10() +
theme(legend.position = "none") +
facet_wrap(~vendor_id) +
labs(y = "Trip duration [s]", x = "Number of passengers")
We find:
train %>%
ggplot(aes(trip_duration, fill = vendor_id)) +
geom_density(position = "stack") +
scale_x_log10()
Comparing the densities of the trip_duration distribution for the two vendors we find the medians are very similar, whereas the means are likely skewed by vendor 2 containing most of the long-duration outliers:
train %>%
group_by(vendor_id) %>%
summarise(mean_duration = mean(trip_duration),
median_duration = median(trip_duration))
## # A tibble: 2 x 3
## vendor_id mean_duration median_duration
## <fct> <dbl> <dbl>
## 1 1 845. 658.
## 2 2 1059. 666.
The original analysis compared “Store and Forward vs trip_duration”, but I presume that the analysis for “Store and Forward” will better performed if we consider the location (longitude and latitude) and the dropoff time (paying time) due to some areas have a weaker signal and also when the possibility of the network being congested.
train %>%
group_by(vendor_id, store_and_fwd_flag) %>%
count()
## # A tibble: 3 x 3
## # Groups: vendor_id, store_and_fwd_flag [3]
## vendor_id store_and_fwd_flag n
## <fct> <chr> <int>
## 1 1 N 670297
## 2 1 Y 8045
## 3 2 N 780302
As we can see above, Store and Forward appears only to vendor 1.
Does the flags Y appear with a higher frequency at a specific time of day?
train %>%
mutate(hour_day = hour(dropoff_datetime)) %>%
group_by(hour_day, store_and_fwd_flag) %>%
count() %>%
ggplot(aes(hour_day, n, color = store_and_fwd_flag)) +
geom_point(size = 4) +
scale_y_log10() +
labs(x = "Hour of the day",y = "No connection to the server occurrences") +
ggtitle("Vendor_ID 1 - Store and Forward") +
theme(plot.title = element_text(hjust = 0.5))
Flag Y follow the distribution frequency by hour of the Flag N, so we can’t determine exactly with how many occurences per hour the problem aggravates.
Let’s take a look in the position of the Store_and_fwd_flag = Y to try to find a correlation:
train %>%
filter(store_and_fwd_flag == "Y") %>%
leaflet() %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 1,
color = "blue", fillOpacity = 0.3)
Visually the map showed no relationship between a certain area and the Store_and_flag = Y.
In this section we build new features from the existing ones, trying to find better predictors for our target variable.
The new temporal features (date,month,wday, hour) are derived from the pickup_datetime. We got the JFK and LA Guardia airport coordinates from Wikipedia. The blizzard feature is based on the external weather data.
jfk_coord <- tibble(lon = -73.778889, lat = 40.639722)
la_guardia_coord <- tibble(lon = -73.872611, lat = 40.77725)
pick_coord <- train %>%
select(pickup_longitude, pickup_latitude)
drop_coord <- train %>%
select(dropoff_longitude, dropoff_latitude)
train$dist <- distCosine(pick_coord, drop_coord)
train$bearing <- bearing(pick_coord, drop_coord)
train$jfk_dist_pick <- distCosine(pick_coord, jfk_coord)
train$jfk_dist_drop <- distCosine(drop_coord, jfk_coord)
train$lg_dist_pick <- distCosine(pick_coord, la_guardia_coord)
train$lg_dist_drop <- distCosine(drop_coord, la_guardia_coord)
train <- train %>%
mutate(speed = (dist/trip_duration)*3.6,
date = date(pickup_datetime),
month = month(pickup_datetime, label = TRUE),
wday = wday(pickup_datetime, label = TRUE),
wday = fct_relevel(wday,c("seg","ter", "qua", "qui", "sex", "sáb", "dom")),
hour = hour(pickup_datetime),
work = (hour %in% seq(8,18)) & (wday %in% c("seg","ter", "qua", "qui", "sex")),
jfk_trip = (jfk_dist_pick < 2e3) | (jfk_dist_drop < 2e3),
lg_trip = (lg_dist_pick < 2e3) | (lg_dist_drop < 2e3),
blizzard = !((date < ymd("2016-01-22") | (date > ymd("2016-01-29"))))
)
Let’s take a look in the created fields:
glimpse(train)
## Observations: 1,458,644
## Variables: 26
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <fct> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, ...
## $ dropoff_datetime <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, ...
## $ passenger_count <fct> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dist <dbl> 1500.1995, 1807.5298, 6392.2513, 1487.1625,...
## $ bearing <dbl> 99.932546, -117.063997, -159.608029, -172.7...
## $ jfk_dist_pick <dbl> 22315.02, 20258.98, 21828.54, 21461.51, 236...
## $ jfk_dist_drop <dbl> 21025.4008, 21220.9775, 20660.3899, 21068.1...
## $ lg_dist_pick <dbl> 9292.897, 10058.778, 9092.997, 13228.107, 8...
## $ lg_dist_drop <dbl> 7865.248, 11865.578, 13461.012, 14155.920, ...
## $ speed <dbl> 11.869710, 9.814641, 10.834324, 12.479686, ...
## $ date <date> 2016-03-14, 2016-06-12, 2016-01-19, 2016-0...
## $ month <ord> mar, jun, jan, abr, mar, jan, jun, mai, mai...
## $ wday <ord> seg, dom, ter, qua, sáb, sáb, sex, sáb, sex...
## $ hour <int> 17, 0, 11, 19, 13, 22, 22, 7, 23, 21, 22, 1...
## $ work <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FAL...
## $ jfk_trip <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ lg_trip <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ blizzard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
From the coordinates of the pickup and dropoff points we can calculate the direct distance (as the crow flies) between the two points, and compare it to our trip_durations. Since taxis aren’t crows (in most practical scenarios), these values correspond to the minimum possible travel distance.
To compute these distances we are using the distCosine function of the geosphere package for the spherical trigonometry. This method gives us the sortest distance between two points on a spheical earth. For the purpose of this localised analysis we choose to ignore ellipsoidal distortion of the earth’s shape. Here are the raw values of distance vs duration (based on a down-sized sample to speed up the kernel).
set.seed(4321)
train %>%
sample_n(5e4) %>%
ggplot(aes(dist,trip_duration)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Direct distance [m]", y = "Trip duration [s]")
We find:
Let’s filter the data a little bit to remove the extreme (and the extremely suspicious) data points, and bin the data indo a 2-d histogram.
This plot shows that in log-log space the trip_duration is increasing slower than linear for large distance values:
train %>%
filter(trip_duration < 3600 & trip_duration > 120) %>%
filter(dist > 100 & dist < 100e3) %>%
ggplot(aes(dist,trip_duration)) +
geom_bin2d(bins = c(500,500)) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Direct distance [m]", y = "Trip duration [s]")
Distance over time is of course velocity, and by computing the average apparent velocity of our taxis we will have another diagnostic to remove bogus values. Of course, we won’t be able to use speed as a predictor for our model, since it requires knowing the travel time, but it can still be helpful in cleaning up our training data and findind other features with predictive power. This is the speed distribution:
train %>%
filter(speed > 2 & speed < 1e2) %>%
ggplot(aes(speed)) +
geom_histogram(fill = "red", bins = 50) +
labs(x = "Average speed [km/h] (direct distance)")
Well, after removing the most extreme values this looks way better than I would have expected. An average speed of around 15 km/h sounds probably reasonable for NYC. Everything above 50 km/h certainly requires magical cars (or highway travel). Also keep in mind that this refers to the direct distance and that the real velocity would have been always higher.
In a similar way as the average duration per day and hour we can also investigate the average speed for these time bins:
p1 <- train %>%
group_by(vendor_id,wday) %>%
summarise(speed_median = median(speed)) %>%
ggplot(aes(wday,speed_median, color = vendor_id)) +
geom_point(size = 4) +
labs(x = "Day of the week", y = "Median speed [km/h]" )
p2 <- train %>%
group_by(vendor_id,hour) %>%
summarise(speed_median = median(speed)) %>%
ggplot(aes(hour,speed_median, color = vendor_id)) +
geom_smooth(method = "loess", span = 1/2) +
geom_point(size = 4) +
labs(x = "Hour of the day", y = "Median speed [km/h]" )
theme(legend.position = "none")
## List of 1
## $ legend.position: chr "none"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
p3 <- train %>%
group_by(wday,hour) %>%
summarise(speed_median = median(speed)) %>%
ggplot(aes(hour,wday, fill = speed_median)) +
geom_tile() +
labs(x = "Hour of the day", y = "Day of the week") +
scale_fill_distiller(palette = "Spectral")
layout <- matrix(c(1,2,3,3), 2,2, byrow = TRUE)
multiplot(p1,p2,p3, layout = layout)
We find:
In our maps (above) and trip paths (below) we noticed that a number of trips began or ended at either of the two NYC airports: JFK and La Guardia. Since airports are usually not in the city center, it’s reasonable to assume that the pickup/dropoff distance from the airport could be a useful predictor for longer trip_durations. Above, we defined the coordinates of the two airports and computed the corresponding distances:
p1 <- train %>%
ggplot(aes(jfk_dist_pick)) +
geom_histogram(fill = "red", bins = 30) +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "JFK pickup distance", y = "count" )
p2 <- train %>%
ggplot(aes(lg_dist_pick)) +
geom_histogram(fill = "red", bins = 30) +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "La Guardia pickup distance", y = "count" )
p3 <- train %>%
ggplot(aes(jfk_dist_drop)) +
geom_histogram(fill = "blue", bins = 30) +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "JFK dropoff distance", y = "count" )
p4 <- train %>%
ggplot(aes(lg_dist_drop)) +
geom_histogram(fill = "blue", bins = 30) +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "La Guardia pickup distance", y = "count" )
layout <- matrix(c(1,2,3,4), 2,2, byrow = TRUE)
multiplot(p1,p2,p3,p4,layout = layout)
Based on these numbers, we can define a JFK/La Guardia trip as having a pickup or dropoff distance of less than 2 km from the corresponding airport.
What are the trip_durations of these journeys? # Data cleaning Before we turn to the modelling it is time to clean up our training data. We have waited to do this until now to have a more complete overview of the problematic observations. The aim here is to remove trips that have improbable features, such as extreme trip duration.
While there might also be a number of bogus trip durations in the test data we shouldn’t be able to predict them in any case (unless there were some real correlations). By removing these training data values we will make our model more robust and more likely to generalise to unseen data, which is always our primary goal in machine learning.